32 research outputs found
Learning from Data Streams: An Overview and Update
The literature on machine learning in the context of data streams is vast and
growing. However, many of the defining assumptions regarding data-stream
learning tasks are too strong to hold in practice, or are even contradictory
such that they cannot be met in the contexts of supervised learning. Algorithms
are chosen and designed based on criteria which are often not clearly stated,
for problem settings not clearly defined, tested in unrealistic settings,
and/or in isolation from related approaches in the wider literature. This puts
into question the potential for real-world impact of many approaches conceived
in such contexts, and risks propagating a misguided research focus. We propose
to tackle these issues by reformulating the fundamental definitions and
settings of supervised data-stream learning with regard to contemporary
considerations of concept drift and temporal dependence; and we take a fresh
look at what constitutes a supervised data-stream learning task, and a
reconsideration of algorithms that may be applied to tackle such tasks. Through
and in reflection of this formulation and overview, helped by an informal
survey of industrial players dealing with real-world data streams, we provide
recommendations. Our main emphasis is that learning from data streams does not
impose a single-pass or online-learning approach, or any particular learning
regime; and any constraints on memory and time are not specific to streaming.
Meanwhile, there exist established techniques for dealing with temporal
dependence and concept drift, in other areas of the literature. For the data
streams community, we thus encourage a shift in research focus, from dealing
with often-artificial constraints and assumptions on the learning mode, to
issues such as robustness, privacy, and interpretability which are increasingly
relevant to learning in data streams in academic and industrial settings
Evaluation methods and decision theory for classification of streaming data with temporal dependence
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data
The NOW Database of Fossil Mammals
This chapter was completed and accepted after
revision in August 2021. NOW does not have dedicated institutional
funding. The database and data development are funded from regular
research projects of the NOW Community members. Current and recent
(last 5 years) funding sources include: The Ella and Georg Ehrnrooth
Foundation and The Academy of Finland. ICP researchers acknowledge funding from the “Generalitat de Catalunya (CERCA
Programme)”, R+D+I projects “PID2020-117289GB-I00” and
“PID2020-116908GB-I00” (MCIN/AEI/10.13039/501100011033/) and
consolidated research group from the Generalitat de Catalunya “2022
SGR 00620”. This is Bernor’s NSF FuTRES publication 35. L. K. Säilä
acknowledges Academy of Finland Postdoctoral grant (275551). We
thank three reviewers for helpful suggestions regarding the manuscript
text. Contributions from the Valio Armas Korvenkontio Unit of Dental
Anatomy in Relation to Evolutionary Theory are acknowledged.NOW (New and Old Worlds) is a global database of fossil mammal occurrences, currently containing around 68,000 locality-species entries. The database spans the last 66 million years, with its primary focus on the last 23 million years. Whereas the database contains records from all continents, the main focus and coverage of the database historically has been on Eurasia. The database includes primarily, but not exclusively, terrestrial mammals. It covers a large part of the currently known mammalian fossil record, focusing on classical and actively researched fossil localities. The database is managed in collaboration with an international advisory board of experts. Rather than a static archive, it emphasizes the continuous integration of new knowledge of the community, data curation, and consistency of scientific interpretations. The database records species occurrences at localities worldwide, as well as ecological characteristics of fossil species, geological contexts of localities and more. The NOW database is primarily used for two purposes: (1) queries about occurrences of particular taxa, their characteristics and properties of localities in the spirit of an encyclopedia; and (2) large scale research and quantitative analyses of evolutionary processes, patterns, reconstructing past environments, as well as interpreting evolutionary contexts. The data are fully open, no logging in or community membership is necessary for using the data for any purpose.Georg Ehrnrooth FoundationNational Science Foundation
NSFAcademy of Finland
275551 AKAGeneralitat de Catalunya
2022 SGR 00620MCIN/AEI/10.13039/501100011033: PID2020-117289GB-I00, PID2020-116908GB-I00Academy of Finland Postdoctoral (275551
Clustering based active learning for evolving data streams
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported
Multi-output regression with structurally incomplete target labels : A case study of modelling global vegetation cover
Publisher Copyright: © 2022 The AuthorsWeakly-supervised learning has recently emerged in the classification context where true labels are often scarce or unreliable. However, this learning setting has not yet been extensively analyzed for regression problems, which are typical in macroecology. We further define a novel computational setting of structurally noisy and incomplete target labels, which arises, for example, when the multi-output regression task defines a distribution such that outputs must sum up to unity. We propose an algorithmic approach to reduce noise in the target labels and improve predictions. We evaluate this setting with a case study in global vegetation modelling, which involves building a model to predict the distribution of vegetation cover from climatic conditions based on global remote sensing data. We compare the performance of the proposed approach to several incomplete target baselines. The results indicate that the error in the targets can be reduced by our proposed partial-imputation algorithm. We conclude that handling structural incompleteness in the target labels instead of using only complete observations for training helps to better capture global associations between vegetation and climate.Peer reviewe
Concept drift over geological times : predictive modeling baselines for analyzing the mammalian fossil record
Fossils are the remains organisms from earlier geological periods preserved in sedimentary rock. The global fossil record documents and characterizes the evidence about organisms that existed at different times and places during the Earth's history. One of the major directions in computational analysis of such data is to reconstruct environmental conditions and track climate changes over millions of years. Distribution of fossil animals in space and time make informative features for such modeling, yet concept drift presents one of the main computational challenges. As species continuously go extinct and new species originate, animal communities today are different from the communities of the past, and the communities at different times in the past are different from each other. The fossil record is continuously increasing as new fossils and localities are being discovered, but it is not possible to observe or measure their environmental contexts directly, because the time is gone. Labeled data linking organisms to climate is available only for the present day, where climatic conditions can be measured. The approach is to train models on the present day and use them to predict climatic conditions over the past. But since species representation is continuously changing, transfer learning approaches are needed to make models applicable and climate estimates to be comparable across geological times. Here we discuss predictive modeling settings for such paleoclimate reconstruction from the fossil record. We compare and experimentally analyze three baseline approaches for predictive paleoclimate reconstruction: (1) averaging over habitats of species, (2) using presence-absence of species as features, and (3) using functional characteristics of species communities as features. Our experiments on the present day African data and a case study on the fossil data from the Turkana Basin over the last 7 million of years suggest that presence-absence approaches are the most accurate over short time horizons, while species community approaches, also known as ecometrics, are the most informative over longer time horizons when, due to ongoing evolution, taxonomic relations between the present day and fossil species become more and more uncertain.Peer reviewe
Adaptyvus mokymo imties formavimas
Nowadays, when the environment is changing rapidly and dynamically, there is a particular need for adaptive data mining methods. `Spam' filters, personalized recommender and marketing systems, network intrusion detection systems, business prediction and decision support systems need to be regularly retrained to take into account changing nature of the data. In the stationary settings the more data is at hand, the more accurate model can be trained. In the changing environment an old data decreases the accuracy. In such a case only a subset of the historical data might be selected to form a training set. For instance, the training window strategy uses only the newest historical instances. In the thesis adaptive data mining methods are addressed, which are based on selective training set formation. The thesis improves the training strategies under sudden, gradual and recurring concept drifts. Four adaptive training set formation algorithms are developed and experimentally validated, which allow to increase the generalization performance of the base models under each of the three concept drift types. Experimental evaluation using generated and real data confirms improvement of the classification and prediction accuracies as compared to using all the historical data as well as the selected existing adaptive learning algorithms from the recent literature. A tailored method for an industrial boiler application, which unifies several drift types, is developed
Adaptive Training Set Formation
Nowadays, when the environment is changing rapidly and dynamically, there is a particular need for adaptive data mining methods. `Spam' filters, personalized recommender and marketing systems, network intrusion detection systems, business prediction and decision support systems need to be regularly retrained to take into account changing nature of the data. In the stationary settings the more data is at hand, the more accurate model can be trained. In the changing environment an old data decreases the accuracy. In such a case only a subset of the historical data might be selected to form a training set. For instance, the training window strategy uses only the newest historical instances. In the thesis adaptive data mining methods are addressed, which are based on selective training set formation. The thesis improves the training strategies under sudden, gradual and recurring concept drifts. Four adaptive training set formation algorithms are developed and experimentally validated, which allow to increase the generalization performance of the base models under each of the three concept drift types. Experimental evaluation using generated and real data confirms improvement of the classification and prediction accuracies as compared to using all the historical data as well as the selected existing adaptive learning algorithms from the recent literature. A tailored method for an industrial boiler application, which unifies several drift types, is developed